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Abstract 

Motivated by chemical applications, we revisit and extend a family of positive defini te kernels 
for gra phs based on the detection of common subtrees, initially proposed by iRamon and GartneJ 
J2003h . We propose new kernels with a parameter to control the complexity of the subtrees used 
as features to represent the graphs. This parameter allows to smoothly interpolate between classical 
graph kernels based on the count of common walks, on the one hand, and kernels that emphasize 
the detection of large common subtrees, on the other hand. We also propose two modular extensions 
to this formulation. The first extension increases the number of subtrees that define the feature 
space, and the second one removes noisy features from the graph representations. We validate 
experimentally these new kernels on binary classification tasks consisting in discriminating toxic 
and non-toxic molecules with support vector machines. 

1 Introduction 

There is an increasing need for algorithms to analyze and classify graph data, motivated in particular 
by various applications in chemoinformatics and bioinformatics. An prominent example in chemoin- 
formatics, which motivates this work, is the generic problem of predicting various properties of small 
molecules, such as toxicologica l effects, given their mo lecular graph, that is, the graph representing the 
covalent bonds between atoms (ILeach and Gil let. 2003). Classification of graphs is often associated with 
the problem of graph mining, which consists in det ecting interesting patterns occurring in the graphs, and 



tne problem oi graph mining, which consists in detecting interesting patterns occurring 
using them as features t o buil d predictive models (Ki ng et allll996l:llnokuchi et all 12 



using them as features t o buil d predictive models MKing et auuwotlinokuchi et ai.u 2003: He lma et al. . 
120041: IPeshoand e'etall |2005 ) . As an alternative to this approach, kernel methods associated with graph 



kernels have recently emerged as a promising approach for classification of graph data. Kernel methods 
such as support vector machines (SVM) operate implicitly in a possibly high-dimensional Hilbert space 
of features, in the sense that no explicit computation of the image of the input data in the feature space 
is required. Instead , only the inner product between the images of any two input data points, called the 
kernel, is required (Scholk opf and Smolal 120021 : IShawe-Tavlor and Cristianinil.l2004r) . Applying kernel 
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methods to graph data therefore requires the definition of a kernel between graphs, thereafter simply 
referred to as graph kernel. Choosing a graph kernel implicitly amounts to defining a set of features to 
represent the graphs and an inner pro duct in the space of fe atures. 



Graph kernels were pioneered by Kashima et al. 



and Gartner et al. (2003), who showed how 



to map graphs to an infinite-dimensional feature space indexed by linear subgraphs, and compute an in- 
ner product in that space. The resulting graph kernels comp are two graphs through their common walks, 
weighted by a functi on of their lengths (iG artner et al., 2003) or by their probability under a random walk 
model on the graphs ( Kashi ma et al.Ll2004f) . While this representation might appear restrictive, these ker- 
nels led to promisi ng empirical results, often comparing to state-of-the-art app roaches in the fields of 
chemoinformatics ( Maheetal. , 200.i lRalaivola et allEoolh and bioinformatics (Bo rgwardt et all 20051 : 



Karklin et al., 200 



Nevertheless, Ra mon and Gartnen 12003) highlighted the limited expressiveness of graph kernels 
based on linear features, showing in particular that many different graphs can be mapped to the same 
point in the corresponding feature space. Figure[2illustrates this issue on a simple example. On the other 
hand, they also showed that computing a perfect graph kernel, that is, a kernel mapping non-isomorphic 
graphs to distinct points in the feature space, is NP-hard. This suggests that the expressiveness of 
graph kernels must be traded for their computational c omplexity. As a firs t step t owards a refinement 
of the feature space used in walk-based graph kernels. iRamon and Gart ner (2003) introduced a kernel 
function comparing graphs on the basis of their common subtrees. This representation looks promising 
in particular in chemoinformatics, because physicochemical properties of atoms are known to be related 
to their topological environment that could be well captured by subtrees. However, the relationship 
between the new subtree-based kernel and previous walk-based kernels was not analyzed in details, and 
the relevance of the new kernel was not tested empirically. 

Our motivation in this paper is to study in detail, both theoretically and empirically, the relevance of 
subtree features for graph kernels, and in particular to assess the benefits they bring compared to walk- 
based graph kernels. For that purpose we first revisit the formulation introduced by Ramon and Gartner 
(2003) and propose two new kernels with an explicit description of their feature spaces and correspond- 
ing inner products. We introduce a parameter in the formulations that allows to gradually increase the 
complexity of the subtrees used as features to represent the graphs, the notion of complexity depending 
on the formulation. By decreasing the parameter we recover classical walk-based kernels, and by in- 
creasing it, we can empirically observe in detail the effect of increasing the number and the complexity 
of the tree features used to represent the graphs. Both form ulations can be efficiently c omputed by dy- 
namic programming, in the spirit of the kernel proposed bv IRamon and Gartner (2003). When the size 
of allowed subtrees is increased, however, we observe that the practical use of this kernel is limited by 
the explosion in the number of subtrees occurring in the graphs. In a second step, we therefore introduce 
two extensions to the initial formulation of the kernels that allow, on the one hand, to extend and gen- 
eralize their associated feature space, and on the other hand, to remove noisy features that correspond 
to unwanted subtrees. The different kernels are compared experimentally on two binary classification 
tasks consisting in discriminating toxic from non-toxic molecules with a SVM. 

Although our main motivations are in chemical applications, we adopt the general framework of 
graph kernels in this paper, because the kernels introduced may find different applications in domains 
where data have a natural graph structure, such as bioinformatics, natural language processing or im- 
a ge processing. We assume th a t the reader is familiar with kernel fu nctions and SVMs, and refer him 
tolscholkopf and Smolal ( 2002h : Ishawe-T avlor and Cristia ninil ( 20041) and references therein for a back- 
ground on the subject. The remaining of the paper is organized as follows. Notations and definitions 
related to graphs and trees are introduced in Section|2j followed in Section[3]by the definition of a gen- 
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eral class of kernels based o n the detection of common subtrees. The next section (Section |4ji revisits 
the framework introduced in iRamon and Gart ner (2003), from which two particular graph kernels are 
derived and further extended in Section|5] The kernels are validated experimentally in Section|6j and we 
give concluding remarks in Section 





Figure 1 : Two graphs having the same walk content, namely • : x 5 



x4 and 



x2, and 



consequently map ped to the s ame point of the feature space corresponding a kernel based on the count 
of walks (Gartne r et al! l2003). 



2 Notations and Definitions 

In this section we introduce notations and general definitions related to graphs and trees. 

2.1 Labeled Directed Graphs 

A labeled graph G = (Vg, £g) is defined by a finite set of vertices Vg, a set of edges Eg C Vg x Vg, 
and a labeling function I : Vg U £g — ► A which assigns a label l(x) taken from an alphabet A to any 
vertex or edge x. We let | Vg | be the number of vertices of G, \£q \ be its number of edges, and we assume 
below that a set of labels A common to all graphs has been fixed. In directed graphs, edges are oriented 
and to each vertex u E Vg corresponds a set of incoming neighbors 5~(u) = {v E Vg ■ (v, it) E £g} 
and outgoing neighbors 5 + (u) = {v E Vg '■ (u,v) E £g}- We let d~(u) = \5~{u)\ be the in-degree 
of the vertex u, and d + (u) = |<5 + (-u)| be its out-degree. A walk of length n in the graph G = (Vg,£g) 
is a succession of n + 1 vertices (vq, . . . , v n ) E Vq +1 , such that (vi,Vi + \) E £q for i = 0, . . . , n — 1. 
A path is a walk (vq, . . . , v n ) with the additional condition that i ^ j ■<==> «, ^ Vj. Finally, a graph 
is said to be connected if there is a walk between any pair of vertices when the orientation of edges is 
dropped. 

For applications in chemistry considered below, we associate a labeled directed graph G = (Vg, £g) 
to the planar structure of a molecule. To do so, we let the set of vertices Vg correspond to the set of atoms 
of the molecule, the set of edges £ g to its covalent bonds, and label these graph elements according to 
an alphabet A consisting of the different types of atoms and bonds. Note that since graphs are directed, a 
pair of edges of opposite direction is introduced for each covalent bond of the molecule. Figure[2]shows 
a chemical compound seen as a labeled directed graph. 

2.2 Trees 

A tree t is a directed connected acyclic graph in which all vertices have in-degree one, except one that 
has in-degree zero. The node with in-degree zero is known as the root r(t) of the tree. Nodes with 
out-degree zero are known as leaf nodes, others are called internal nodes. Trees are naturally oriented, 
edges being directed from the root to the leaves. The outgoing neighbors of an internal node are known 
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Figure 2: A chemical compound seen as a labeled graph 



as its children, and the unique incoming neighbor of a node (apart from the root) is known as its parent. 
If two nodes have the same parent, their are said to be siblings. The size \t\ of the tree t is its number 
of nodes: \t\ = \Vt\- The depth of a node corresponds to the number of edges connecting it to the root 
plus one 1 , and the depth of the tree is the maximum depth of its nodes. Finally, we introduce a couple 
of definitions that will be useful in the following. 

Definition 1 (Balanced tree). A perfectly depth-balanced tree of order h is a tree where the depth of 
each leaf node is h. Perfectly depth-balanced trees are also called balanced trees below. 

Definition 2 (Branching cardinality). We define the branching cardinality of the tree t, noted branch(t), 
as its number of leaf nodes minus one. More formally, for the tree t = (V*, £t) with Vt = (i>i, • • • , v\ t \ ), 
branch(t) is given by; 

1*1 

branchit) = ^ l(d + (vi) = 0) - 1, 
i=i 

where 1(.) is a binary function equal to one if its argument is true, and zero otherwise. 

This terminology stems from the observation that this quantity also corresponds to the sum, over the 
non-leaf nodes of the tree, of their numbers of children minus one. It therefore measures how many extra 
branchings there are compared to a linear tree, which has branching cardinality 0. These definitions are 
illustrated in Figure |3] 



Xa 



Figure 3: Left: a tree t\ of depth 5 with \t%\ = 9 and branch(ti) = 3. Right: a balanced tree ti of order 
3 with \ti\ =8 and branch^) = 4. Top nodes are root nodes, bottom nodes are leaf nodes. 




The remaining of the paper introduces kernel functions between labeled directed graphs based on 
the detection in the graphs of patterns corresponding to labeled trees. To lighten notations, we simply 
refer below to labeled directed graphs and labeled trees as graphs and trees. 

'Note that the depth of the root node is one. 
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3 The Tree-Pattern Graph Kernel 



This section introduces a general class of graph kernel based on the detection, in the graphs, of patterns 
corresponding to particular tree structures. We start by defining precisely this notion of tree-pattern. 

Definition 3 (Tree-pattern). Let a graph G = (Va, £g) and a tree t = (Vt,£t), withVt = {n\, . . . , tim). 

A \t\-uple of vertices (v±, . . . ,vm) G Vq is a tree-pattern of G with respect to t, which we denote by 
(«!,..., V\ t \) = pattern(t), if and only if the following holds: 

'Vie [1,1*1], l(v i ) = l{n i ), 

< V(m,nj)e£t, (vi,Vj) £ £g /\l((vi,Vj)) = l((m,nj)) , 

y(m,rij), (rii,n k ) G £ t , j^k <J=^ Vj ^v k . 

In other words a tree-pattern is a combination of graph vertices that can be arranged in a particular 
tree structure, according to the labels and the connectivity properties of the graph. Note from this 
definition that vertices of the graph are allowed to appear several times in a tree-pattern, under the 
condition that siblings nodes of the corresponding tree are associated to distinct vertices of the graphs. 
We now introduce a functional to count occurrences of these patterns. 

Definition 4 (Tree-pattern counting function). A tree-pattern counting function returning the number 
of times a tree-pattern occurs in a graph is defined for the tree t and the graph G = (Vg, £g)> Vg = 
(vi,...,v ]Va \), as 

1/H(G) = \{(a!,...,a {tl ) G [ljVcl] 1 * 1 : (v ai , . . . ,v a]t{ ) = pattern(t)}\. 

A restriction ofipt to patterns rooted in a specified vertex v is given by 

ipl v '(G) = [{(ax,..., a\ t \) G [1,|Vg|]'*' : (v ai , . . . , v aw ) = pattern(t) A v ai = v}\. 

With this new definition at hand we can define a general graph kernel based on the detection of 
common tree-patterns in the graphs. 

Definition 5 (Tree-pattern graph kernel). The tree-pattern graph kernel K is given for the graphs G\ 
and G2 by 

K{G U G 2 ) = J2 w ( t )MGi)MG2), 
ter 

where T is a set of trees, w : T — > R is a tree weighting functional and iftt is the tree-pattern counting 
function of Definition^ 

The kernel of Definition |5] is obviously positive definite since it can be written as a standard dot- 
product K(G\,G2) = (4>(Gi),cj)(G2)), where <fi(G) is the mapping that maps any graph G to the 
feature space indexed by the trees of the set T as 0(G) = ( yMf)^t{G)) t&r - Figure |4] illustrates this 
mapping. 

4 Examples of tree-pattern graph kernels 

In a recent work, Ram on and Gartnerl d2003h proposed a particular tree-pattern graph kernel fitting the 
general Definition |5] In this section, we propose two different kernels with explicit feature spaces and 
inner products, discuss the ir practical computation, and highlight their differences with the kernel of 
iRamon and Gartnerl d2003l) . 
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Figure 4: A molecular compound G (left) and its feature space representation 4>{G) (right). Note that the 
red and green trees are balanced. Note moreover that the green tree consists of a set of linearly connected 
atoms, which is known as molecular fragment in chemoinformatics. Note finally that the same C atom 
appears in the 3rd and 5th positions in the tree-pattern corresponding to the green tree. 



4.1 Kernels Definition 

According to Definition^ two key elements enter in the definition of a tree-pattern graph kernel. Firstly, 
the set of trees T indexing the feature space the graphs are mapped to must be chosen. The kernels we 
consider in this section are based on the same feature space: the space indexed by the set of balanced 
trees of order h introduced in Definition [0 labeled according to the graphs labeling alphabet A. We will 
refer to this set as Bh in the following. Second, the tree weighting function w must be defined. A natural 
way to define such a functional is to take into account the structure of the trees, and accordingly, we 
propose to relate the weight of a tree to its size or its branching cardinality. In particular we propose to 
consider the following kernels: 

Definition 6 (Size-based balanced tree-pattern kernel). For the pair of graphs G\ and G2, the size- 
based balanced tree-pattern kernel of order h is defined as 

K h Size {Gi,G 2 ) = £ \^- h MGi)MG 2 ). (1) 
tet3 h 

Definition 7 (Branching-based balanced tree-pattern kernel). For the pair of graphs G\ and G 2 , the 

branching-based balanced tree-pattern kernel of order h is defined as 

KLnck(Gx,G 2 ) = ^ X"^MGi)MG2)- (2) 
teB h 

Note that the depth of a tree is a lower bound on its size, attained for a tree consisting of a linear 
chain of vertices. For such a tree, at depth h, we have \t\ — h = branch(t) = 0, and we see that the cor- 
responding tree-patterns are given a unit weight in the kernels of Definitions |6] and Q The complexity of 
a tree naturally increases with its size and branching cardinality, and the A parameter entering the kernel 
Definitions |6] and has the effect of favoring tree-patterns depending on their degree of complexity. A 
value of A greater than one favors the influence of tree-patterns of increasing complexity over the trivial 
linear tree-patterns, while they are penalized by a value of A smaller than one. We can note, however, 
that while the size of a tree increases with its branching cardinality, the converse is not true. For any tree 
t of depth h, we therefore always have \t\ — h > branch(i), and the tree weighting is more important in 
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the size-based than in the branching-based kernel. In the case of balanced trees, this difference is par- 
ticularly marked when the nodes with large out-degree are close to the root node. This is due to the fact 
that every leaf must be at depth h, and while the size of the tree necessarily increases by at least h — 1 
along each path starting from the root, the branching cardinality does not 2 . The main difference in the 
feature space representations of the graphs is therefore induced by this particular type of tree-patterns, 
that can be interpreted as collections of regular subtree patterns merged in the root node. This suggests 
for instance that, for A < 1, the branching-based formulation of the kernel may to some extent tolerate 
large, yet regular patterns, that would be strongly penalized in the size-based formulation. Figure |5] 
illustrates these tree weightings based on the size and branching cardinality. 




X° IX X 1 1 X 1 X 2 IX 1 X 2 1 X 1 X 4 IX 1 X 4 1 X 3 

Figure 5: A set of balanced trees of order 3, together with their size-based (left) and branching-based 
(right) A weighting. 



When A tends to zero, the complexity of the patterns is so penalized that only tree-patterns consisting 
of linear chains of graph vertices have non- vanishing weight s, and the kernels of Definitions |6]and[Z]boil 
down to a kernel based on the detection of common walks JGartner et all 12003). More formally, if we 
define the set of walks of length n of the graph G as 

W„(G) = {(v , ...,«„)€ V™ +1 : (vi, v i+1 ) € £ G , < i < n - 1}, 

and define for the graphs G\ and G2 the following walk-count kernel: 

^waik(Gi,G< 2 )= Y, E 1GM = 1M). (3) 

M>l€ U12G 
Wn(Gl) W n (G 2 ) 

where 1(1 (wi) = l(w2)) is one if all pairs of corresponding edges and vertices are identically labeled in 
the walks w\ and w 2 , and zero otherwise, one easily gets that: 

lim K K^(G U G 2 ) = ^K^ ch (G u G 2 ) = K^(G 1} G 2 ). 

A — *U A — >0 

Increasing the value of A relaxes the penalization on complex subtree features, and can therefore be 
interpreted as introducing tree-patterns of increasing complexity in the walk-based kernel of Equation |3] 



It should be noted finally that the parameters h and A are directly related to the nature of the features 
representing the graphs and to their relative importance. Optimal values of the parameters are therefore 
likely to be dependent on the problem and data considered, and can hardly be chosen a priori. As an 
example, because of the variety of chemical compounds, the graphs considered in a chemical application 
can have a great structural diversity. This suggests that these parameters should be estimated from the 
data using, for example, cross-validation techniques. 

2 At the extreme, we have \t\ = 1 + (h - 1) X d + (r(t)) Vs branch(t) = d + (r(t)) - 1. 
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4.2 Kernels Computation 

We now propose two factorization schemes to compute the kernels of Definitions [6] and |7] These fac- 
torizat ions are inspired by the dynamic programming (DP) algorithm proposed by Ramon and Gartner 
(2003) to compute a slightly different graph kernel, discussed in the next subsection. The factorization 
relies on the following definition: 

Definition 8 (Neighborhood matching set). The neighborhood matching set M. (u,v) of two graph 
vertices u and v is defined as 

M(u,v) = {R C 5 + (u) x 5+(v) | (V(a,6),(c,d) G R:a^cAb^d) 

A (V(o, b) G R : 1(a) = 1(b) A l((u, a)) = l((v, b)))}. 

Each R G M(u, v) consists of one or several pair(s) of neighbors of u and v that are identically 
labeled and connected to u and v by edges of the same label. It follows from Definition ^ that such 
an element R corresponds to a pair of balanced tree-patterns of order 2 rooted in u and v, found in the 
graph(s) u and v belong to. Moreover, provided u and v have the same label, these patterns correspond 
to the same balanced tree. We can state the following propositions, whose proofs are post-poned in 
Appendix IaI 

Proposition 1 (Size-based kernel computation). The order h size-based tree-pattern kernel Kg- of 
Definition^between two graphs G\ and G2 can be computed as: 

K$ ize (G 1 ,G 2 ) = ± E W 

where k n ,n = 1, . . . , h is defined recursively by 
'ki(u,v) = Xl(l(u) = l(v)), 

k n (u,v) = Xl(l(u) = l(v)) ^ Y\ k n-i(u',v'), n = 2,...,h. 

R€M(u,v) (u',v')eR 

Proposition 2 (Branching-based kernel computation). The order h branching-based tree-pattern ker- 
nel Kg ranch of Definition\7\between two graphs G\ and G2 can be computed as: 

Kl anch (G u G 2 ) = E k ^ v ^ ^ 

«eV Gl veVa 2 

where k n ,n = 1, . . . , h is defined recursively by 
'h(u,v) = l(l(u) = l(v)) , 

k n (u,v) = l(l(u) = l(v)) E X n Xk n -i(u',v'), n = 2,...,h. 

ReM(u,v) (u',v')eR 

Not surprisingly, Propositions [2and|2]show that the kernels -K^size anc ^ -^Branch of Definitions |6]and0 
have the same complexity. More precisely, for the pair of graphs G\ and G2, it follows from (@J and (|5} 
that this complexity is equal to the product of the sizes of Gi and G2, times the complexity of evaluating 
the functional k^. In both cases, for the pair of graph vertices u and v, evaluating kh(u, v) amounts to 
summing, over all possible matching of neighbors R G M.(u, v), a quantity expressed as a product of 
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\R\ functionals fc/j_i. The size of Ai(u,v), \M(u,v)\, is maximal if all the neighbors of u and v, as 
well as the edges that connect them to u and v, are identically labeled. In that case we have 

\M(u,v)\= Yl A d+W A #-(v)> 

k=l 

where k ranges over the cardinality \R\ of the set of matching neighbors. If we let d be an upper bond 
on the out-degree of the vertices of the graphs considered, it follows that \M(u, v)\ < Ylt=i( A d) 2 an< ^ 
we can derive the following worst case complexity 

d 

0{K^{G x ,G 2 )) = 0(K^ anch (G 1 ,G 2 )) = \V Gl \ x |V Ga | x KA^f' 1 . 

k=l 

In the case of chemical compounds, we have d = 4. The factor Ylk=i k(Aj) 2 equals 4336, and the 
complexity looks prohibitive. However this is only a worst-case complexity which is strongly reduced 
in practice because (i) the out-degree of the vertices is often smaller than 4 3 , and (ii) the size of A4(u, v) 
is reduced by the fact that vertices and edges can have distinct labels. 



4.3 Relation to previous work 



At this point, it is worth reminding the kernel formulation introduced by Ramon and Gartner (2003) in 
order to highlight the differences with the kernels proposed in Definitions |6] and |7] In the context o f 
graphs with labeled vertices and edges 4 , at order h, the kernel introduced in Ra mon and Gartner! (J2003), 
that we denote by K^ amon , is formulated as follows: 



K^ on {G u G 2 ) = Y Y k h{u,v), 



where k n is defined by 



ki(u,v) = 1(70) = K v )) 

k n (u,v) = l(l(u) = l(v)) X u \ v Y I~I k n -i(u',v'), n = 2,...,h. 

ReM(u,v) (u',v')&R 

It is clear that this kernel and the kernels of Definitions |6] and Q have the same feature space. The main 
difference lies in the fact that in this formulation, a parameter X v is introduced for each vertex v of each 
graph. It can be checked that under this parametrization, each tree-pattern is weighted by the product of 
the parameters A„ associated to its internal nodes. In the special case where these parameters are taken 
equal to a single parameter A, each pattern is therefore weighted by A raised to the power of its number 
of internal nodes. While this bears some similarity with the size-based weighting proposed in the kernel 
of Definition |6] we note for instance that the three leftmost trees of Figure |5] are identically weighted, 
namely by a factor A 2 . Moreover, the convergence to the walk-based kernel of Equation |3] observed 
when A tends to zero for the kernels of Dennition|6]and0does not hold with this formulation. 

3 For example, in the two datasets considered in our experiments in section|6| the average out-degree of the vertices is nearly 
2 (2.14 for the first dataset, and 2.06 for the second one). 

4 The original formulation considered graphs with labeled vertices only, and the definition of the neighborhood matching 
set is refined in this paper in order to handle labeled edges. 



9 



5 Extensions 



The kernels introduced in the previ ous section arise directly from the adaptation of the algorithm pro- 
posed in lRamon and Gartnerl (120031) . In this section we introduce two extensions to this initial formula- 
tion. First, we extend the branching-based kernel of Definition to a feature space indexed by a larger, 
and more general, set of trees. Second, we propose to eliminate a set of noisy tree-patterns from the 
feature space. 

5.1 Considering all trees 

The DP algorithms of Section FOl recursivelv extend the tree-patterns under construction until they reach 
a specified depth. Because they are based on the notion of neighborhood matching sets introduced in 
Definition |8j these algorithms add at least one child to every leaf node of the patterns under extension 
at each step of the recursive process. When they reach the specified depth, the patterns are therefore 
balanced, and the choice of the feature space associated to the kernels of Definitions|6]and0was actually 
dictated by their computation. 

Rather than focusing on features of a particular size, standard representations of molecules in- 
yolve structural featur es of different sizes. A prominent example is that of molecular fingerprints 
( Ralaivola et all 12005) that typically represent a molecule by its exhaustive list of fragments of length 
up to 8, where a fragment is defined as a linear succession of connected atoms (see Figure |4}. In this 
section, we note that a slight modification of the DP algorithm of Proposition ^generalizes the kernel of 
Definition to a feature space indexed by the set of general trees up to a given depth, instead of the set 
of balanced-trees of the corresponding order. More precisely, if we let 7^ be the set of trees of depth up 
to h, and if we define the until-N extension of the branching -based kernel of Definitional as 

<^(Gi, G 2 ) = E X^^MG^t^), (6) 
teT h 

we can state the following proposition, whose proof is postponed in Appendix 151 

Proposition 3 (Until-N kernel computation). The until-N extension Kg%£^ of the branching-based 
kernel of order h of Definition[7\is given for the graphs G\ and G2 by 

K££&(Gi,G2)= £ £ k h (u,v), 
uev Gl v 6V G2 

where k n ,n = 1, . . . , h is defined recursively by 
'k 1 (u,v) = l(l(u)=l(v)), 

' k n (u,v) = l(l(u) = l(v)) 1+ ^ X II M n -l(u',v')\ , n = 2,...,h. 

The computation given in Proposition |3] follows that of Proposition |2j and this until-N extension 
comes at no extra cost. The feature space corresponding to this extended kernel has nevertheless a 
much larger dimensionality than that of the original branching-based kernel. Actually, because the set 
of trees includes the set of balanced trees Bh as a special case, the feature space associated to the 
branching-based kernel is a sub-space of the feature space associated to its until-N extension. Figure |6] 
illustrate the different mappings. The behavior of this kernel with respect to A follows that of the original 
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Figure 6: A graph G, and the set of balanced trees of order 3 (left) and general trees of depth up to 
3 (right) for which a tree-pattern rooted in the dashed vertex is found in G, together with their kernel 
weighting A branch W. 



branching-based kernel. In particular, when A tends to zero, the set of tree-patterns with non-vanishing 
weights reduces to linear chain of vertices and the kernel boils down to a kernel based on the detection 
of common walks of length up to h — 1. More formally, one can easily check that, in this case: 

h-l 

1™ K &mch(Gl,G 2 ) = V^-FT Walk (Gi,G 2 ), 

A — >{) 

n=0 

where i^ Walk is the kernel based on the detection of common walks of length n, defined in Section |4~T1 
Equation |3] 

Finally, we note that this extension is not directly applicable to the size-based kernel of Definition |6] 
because of a slight difference in the computations of Propositions ^and|2] Indeed, note from Proposition 
[2that in order to get the \\ l \~ h weighting of the tree t proposed in Definition |6j the size-based kernel 
is initially computed from patterns weighted by their sizes, and is subsequently normalized by a factor 
\~ h . As a result, while the above extension would still have the effect of extending the feature space to 
the space indexed by trees of 7^, this \~ h normalization would affect every tree-pattern regardless of 
their size, and the pattern weighting proposed in Definition |6] would be lost. 

5.2 Removing tottering tree-patterns 

The DP algorithms of Sections 14.21 and 15.11 enumerate balanced tree-patterns of order h through the 
recursive extension of balanced tree-patterns of order 2 defined by neighborhood matching sets of pairs 
of vertices. According to Definition [8j the whole sets of neighbors of a pair of vertices enter in the 
definition of their neighborhood matching sets. As a result, it can be the case in a tree-pattern that a 
vertex appears simultaneously as the parent and a child of a second vertex. This phenomenon is the tree 
counterpart of a phenomenon observed in the context of walk-based graph kernels, where a random walk 
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under extension co uld return to a visited vertex just after leaving it. This behavior was called tottering in 
Mahe et all (12005), and following this terminology, we refer to a tree-pattern in which a vertex appears 
simultaneously as the parent and a child of a second vertex as a tottering tree-pattern. FigureQillustrates 
the tottering phenomenon. 





Figure 7: Left: tottering (red) and no-tottering (blue) walks. Right: tottering (red) and no-tottering(blue) 
tree-patterns. 



In many cases these tree-patterns are likely to be uninformative features. In particular they are not 
proper subgraphs of the initial graphs. Even worse, the ratio of the number of tottering tree-patterns 
over the number of non-tottering tree-patterns quickly increases with the depth h of the frees, suggesting 
that informative patterns corresponding to deep trees might be hidd en by the profusio n of tottering tree- 
patterns. In order to tackle this issue we now adapt an idea of Mah e et al 1 J2005I ) to filter out these 
spurious tottering tree-patterns in the kernels presented in Sections |3] and |4] Tottering can be prevented 
by adding constraints in the tree-pattern counting function, according to the following definition. 

Definition 9 (No-tottering tree-pattern counting function). From the tree-pattern counting function 
of Definition^ a no-tottering tree-pattern counting function can be defined for the tree t = (Vt, £t)> with 
H = (ni, . . . , ni t |), and the graph G = (Vg, £g)> with Vg = (fi, • • • , v \v G \)> as 

^ T (G) = |{(ai, . . . € [IJVgI] 1 * 1 : (v ai , . . . , v a{t] ) =pattern{t) 

A (rii,rij), (rij,n k ) <E £ t a { ^ a k }\- 

Following Definition |3J a graph kernel based on no-tottering tree-patterns can be defined from this 
no-tottering tree-pattern counting function. 

Definition 10 (No-tottering tree-pattern kernel). A graph kernel K NT based on no-tottering tree- 
patterns is given for the graphs G\ and G2 by 

K NT (G U G 2 ) =J2^mr(Gt)^ T (G2), (7) 

where T is a set of trees, w : T — > R is a tree weighting functional and ipt fT is the no-tottering 
tree-pattern counting function of Definition^ 

This latter definition therefore extends the tree-pattern kernel of Definition|5]to the no-tottering case. 
However, due to the additional constraints on the set of acceptable patterns, the DP framewo rk based on 



neighborhood matching set described in Sections 14. 21 and 15171 does not hold any longer. In Mah e et al 
(2005), the following graph transformation was introduced in order to filter tottering walks. 
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Definition 11 (Graph transformation). For a graph G = (Vg,£g)> we let its transformed graph 

G' = (Vc , £& ) b e defined by: 

• v G , = v G U £ G , 

• £ G > = {{v, (v, t)) \v E V G , (v, t) E £ G } U {((«, v) , («, *)) | («, u) , (u, t) G £ G ,n ^ t}, 
and labeled as follows: 

• /or a node v' € Vc label is either l(v') = l(v') ifv' € Vg, or l(v') = l(v) ifv' = (u, v) € £q, 

• for an edge e' = (v[,v' 2 ) between two vertices v[ € Vg U £q and v 2 € £g> ^ /aoe/ is simply 
given by l[e!) = l(v' 2 ). 

This graph transformation is illustrated in Figure [8] for the graph co rrespo nding to the chemical 
compound of Figure |2] Based on this graph transformation, Mah e et alJ (J2005) proved that there is a 
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Figure 8: The graph transformation. I) The original molecule. II) The corresponding graph G = 
(Vg,£g)- HI) The transformed graph. IV) The labels on the transformed graph. Note that different 
widths stand for different edges labels, and gray nodes are the nodes belonging to Vg- 



bijection between the set of no-tottering walks of a graph and the set of walks of its transformed graph 
that start on a vertex corresponding to a vertex of the original graph. In a similar way, we show below 
that there is a bijection between the set of no-tottering tree-patterns found in a graph and the set of tree- 
patterns found in its transformed graph rooted in a vertex corresponding to a vertex of the original graph. 
This is summarized in the following proposition, which proof is postponed in Appendix ICl 

Proposition 4. If we let G[ (resp. G' 2 ) be the transformed graph of G\ (resp. G2), the no-tottering 
tree-pattern kernel of Definition 1701 is given by 

K NT (G 1 ,G 2 ) = Y,At)^ T {Gi)^ T {G2) 

= ^(t)4 VGl} (G[)4 VG ^(G' 2 ), 
teT 

where, ifG' is the transformed graph of G given by Definition U 1\ Vg C Vg 1 is the set of vertices of G' 

n 

corresponding to the vertices of G, and ip^'"' ,Vn ' [G) = ip[ (G). 

i=l 
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This proposition shows that we can compute no-tottering extensions of the kernels of Definitions |6] 
and0 and of the until-N kernel extension of Equation|6l using the graph transformation of Definition fTTI 
and the original DP algorithms of Sections l4~2l and l5~Tl However, this operation comes at the expense of 
an increase in the cost of computing the kernel. More pr ecisely, by definitio n of the graph transformation, 
we have \Vc\ = + |£gI- Moreover, as noticed bv lMahe et alJ (120051) . the maximum out-degree of 
the vertices of the transformed graph is equal to that of the original graph. As a result, the worst case 
complexity of evaluating the functional kh(u, v) of Propositions [O0and[5]is the same if u and v belong 
to Vqi and Vq* , or Vgi and Vg 2 ■ It follows that for the graphs G\ and G2 we have 



0(K NT ( Gl ,G 2 )) = (\ V oA + \£g^g 2 \ + \£g 2 u ( K( G liG3))t 

\VGiW VG 2 \ 

where K is one of the kernels given in Equations [Q and ® and K^t is its no-tottering extension of 
Definition [10| 



6 Experiments 

We now turn to the experimental section. The problem we consider is a binary classification task con- 
sisting in discriminating toxic from non-toxic molecules. Our main goal is to assess the relevance of 
tree-patterns graph kernels over their walk-based counterparts for this type of chemical applications. To 
do so, recall from section |4~T1 that in the proposed kernels, the influence of the tree-patterns is controlled 
by the parameter A. When A tends to zero, th e kernels converge to kernels based on the count of com- 



mon walks in the graphs (Gart ner et all 120031) . For increasing A, tree-patterns of increasing complexity 
are taken into account with increasing weight in the kernels. One can therefore study the relevance of 
tree-patterns by studying how the performance of the kernels evolves with A > 0, and checking whether 
it improves over their walk-based counterpart obtained for A = 0. 

The first step towards this g oal is to evaluate the kernel s of Definitions |6] and and therefore the 



original formulation presented in Ramo n~nd Gartnerl d2003h . In a second step, we want to validate the 



extensions to these kernels proposed in sections l5~Tl and l5~2l On the one hand we will compare the results 
obtained with the until-N extension of the branching-based kernel © to its initial formulation ©, and 
on the other hand we will compare the results obtained with the no-tottering extensions Q of the size- 
based, branching-based, and until-N branching-based kernels to their original formulations. Because our 
interest here is to get insights about the behavior of the different kernels, we report experimental results 
for varying values of the parameters entering their definition, namely the order h of the patterns, and the 
pattern weighting parameter A. In real-world applications one should of course design a procedure to 
select the best parameters from the date. 

The classification experiments described below were carried out with a support vector machine based 
on the different kernels tested. Each kernel was implemented in C++ within the open-source ChemCpp 
toolbox, and we used the open-source Python machine learning package PyML 5 to perform SVM clas- 
sification. The SVM prediction is obtained by taking the sign of a score function. However, by varying 
this zero decision threshold, it is possible to compute the evolution of the true positive rate versus the 
false positive rate in a curve known as the Receiver Operating Characteristic (ROC) curve. The area un- 
der this curve, known as AUC for Area Under the ROC Curve , is often considered to be a safer indicator 
of the quality of a classifier than its accuracy being 1 for an ideal classifier, and 0.5 for 

a random classifier. The results presented below are averaged AUC values obtained for 10 repetitions of 

5 Available at http : / / pyml . sourcef orge . net 
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a 5-fold cross-validation process. Within each cross-validation fold, the "C" soft-margin parameter of 
the SVM was optimized over a grid ranging from 10" 3 to 10 3 , using an internal cross-validation method 
implemented in PyML. 

We considered two public datasets of che mical compounds in our experiments. Both gather results 
of mutagenicity assays, and while the first one ( King et al., 1996) is a standard benchmark for evaluating 
chemical compounds classification, the second one (Hel ma et all l2004h was introduced more recently. 
The first dataset contains 188 chemical compounds tested for mutagenicity on Salmonella typhimurium. 
The molecules of this dataset belong to the family of aromatic and hetero-aromatic nitro compounds, 
and they are split into two classes: 125 positive examples with high mutagenic activity (positive levels 
of log mutagenicity), and 63 negative examples with no or low mutagenic activity. The second database 
considered consists of 684 compounds classified as mutagens or non-mutagens according to a test known 
as the Salmonella/microsome. assay. This dataset is well balanced with 341 mutagens compounds for 
343 non-mutagens ones. Note that although the biological proper ty to be predicted is the same, the 
two datasets are fundamentally different. While iKing et all Jl996h focused on a particular family of 
molecules, this dataset involves a set of very diverse chemical compounds, qualified as noncongeneric 
in the original paper. To predict mutagenicity, the model therefore needs to solve different tasks : in the 
first case it has to detect subtle differences between homogeneous structures, while in the second case it 
must seek regular patterns within a set of structurally different molecules. 

6.1 First Dataset 
Tree-patterns Vs walk-patterns: 

Figure [9]shows the results obtained for the size-based (left) and branching-based (right) kernels of Defi- 
nitions |6] and Each curve represents the evolution, for < A < 1, of the AUC obtained from patterns 
of a given order h taken between 2 and 10. 

Because the corresponding AUC values start by increasing with A, we can note from Figure [9] (left) 
that the introduction of tree-patterns is beneficial to the size-based kernel for patterns of order greater 
than two. In the case of the branching -based kernel, Figure [9] (right) suggests that this is only true for 
patterns of order greater than 2 and smaller than 6, but Figure [TO] shows that, based on smaller values 
of A, this is still the case for patterns up to order 7. Taken together, Figures |9] and [To] show that the 
optimal AUC values obtained with the size- and branching-based kernels for patterns of order 2 to 7 
are globally similar. Interestingly however, the corresponding A values are systematically smaller in the 
case of the branching -based kernel. This is due to the fact that, as noted in section 14.11 the size-based 
penalization is stronger than the branching-based penalization. As a result, optimal A values observed 
using the size-based kernel are shifted towards zero using the branching-based kernel. 

We can also note from Figures |9] and [To] that optimal values of A tend to decrease for increasing 
h. This is probably due to the fact that the number of tree-patterns increases exponentially with h, 
and, as a result, the kernels need to limit their individual influence. Actually, we observe that higher 
order patterns, with h > 7, can only be considered for sufficiently small values of A. For example, 
we note that the size-based kernel computation does not converge if we consider patterns of order 10 
and A greater than 0.15. In the case of branching -based kernel, due to the weaker pattern penalization, 
this phenomenon is even emphasized, and in that case, 10 -4 is the largest value acceptable for A. This 
difference in the way to penalize the patterns probably explains the fact that while a slight improvement 
over the walk-based kernel can be observed in the case of the size-based kernel when h is greater than 7 
(Figure |9j left), the performance systematically decreases with the branching-based kernel (Fig 

Additionally, we note that because the size- and branching-based penalization of balanced trees of 
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order 2 is the same, the results obtained for h = 2 are identical with the two kernels. Surprisingly 
however, no improvement over the walk-based baseline is observed, which suggests that in this case, the 
tree-patterns do not bring additional information to that contained in the walk features, that consist here 
of simple pairs of connected atoms. 

In conclusion, these experiments demonstrate the improvement of the tree-patterns graph kernels 
over their walk-based counterparts. The impact of the tree-patterns is particularly marked for patterns 
of order 3 and 4, where the two kernels improve by more than 3% the AUC of the corresponding walk- 
based kernel. For patterns of increasing order, this figure gradually decreases, and for patterns of order 
greater than 7, it drops to 1 % in the case of the size-based kernel, while no more improvement is ob- 
served with the branching-based kernel. In both cases, optimal results are obtained for patterns of order 
4, with AUC values of 95.3% and 95.0%. Finally, it is worth noting the combinatorial explosion in the 
number of patterns for large orders, which in practice limits the acceptable values of A to small values. 




Figure 9: First dataset. Evolution of the AUC with respect to A at different orders h. Left: size-based 
kernel Q ; Right: branching-based kernel ©. 

Until-N extension: 

Figure ^Jpresents the results of the until -N extension © of the branching-based kernel ©. The figure 
on the left-hand side, showing the evolution of the AUC for 2 < h < 10 and < A < 1, corresponds to 
that on the right-hand side of Figure |9] The figure on the right-hand side plots these AUC values versus 
corresponding values obtained using the original kernel 

We can first notice strong similarities between the curve in the left-hand side and its original kernel 
counterpart. This is confirmed in the right-hand curve where all the points lie near the diagonal line that 
represents the equivalence between the two kernels. The fact that the differences between the two kernel 
formulations are barely noticeable is quite surprising since their associated feature spaces are intuitively 
quite different. In section 15.11 we mentioned that the feature space associated to the branching-based 
kernel is actually a subspace of the feature space associated to its until-N extension. As a result, Figure 
[^suggests that the extra features related to the until-N extension do not bear additional information into 
the kernel. This hypothesis seems to be confirmed by the fact that the differences between corresponding 
walk-based kernels, observed for A = 0, are not significant neither. This might be explained by the 
fact that the dimensions of the corresponding feature space are probably strongly correlated due to the 
relation of inclusion existing between trees and walks patterns of orders n, and those of order n + 1. 



16 




Figure 10: First dataset, branching-based kernel © . Evolution of the AUC at different orders h for 
small values of A. 



Another possible explanation for the lack of improvement of the until-N extension lies of course in the 
difficulty of learning in high dimension, suggesting that discriminating patterns of a given order are lost 
within the flood of patterns of greater orders taken into account by this until-N extension. 




Figure 11: First dataset, until-N extension. Left: evolution of the AUC with respect to A at different 
orders h, for the until-N extension © of the branching-based kernel Right: AUC values Vs original 
AUC values. 

No-tottering extension: 

FiguresElEl an dE]respectively show the results of the no-tottering extension Q of the size-based Q, 
branching-based and until-N branching-based kernels ©. The curves on the left-hand side show the 
evolution of AUC for 2 < h < 10 and < A < 1, and the curves on the right-hand side plot these AUC 
values versus corresponding values obtained using the original kernels. 

If we compare the results of the no-tottering extensions of the size-based and branching-based ker- 
nels (Figures IT2land[T3l. we can first note that the the introduction of tree-patterns is now systematically 
beneficial for h > 2 in both cases. Moreover, we note that the kernel computations remain feasible for 



17 



h = 10 and A = 1, which means that the no-tottering extension limits the combinatorial explosion we 
observed with the original formulation. While optimal results were obtained for h = 4 using the original 
kernels, we observe that here, in both cases, the performance gradually increases from h = 3 to an opti- 
mum value obtained for h = 8. At a given order, we note that the optimal AUC values obtained with the 
two kernels are similar, and that the corresponding A value is smaller in the case of the branching-based 
kernel, which is consistent with the observations made in the previous section. Optimal AUC values are 
close to 96.5% and improve over the values around 95% observed with the initial formulation. Impor- 
tantly, we note that these optimal values are obtained using parametrizations of the kernels that lead to a 
combinatorial explosion in their initial formulation. Finally, from the fact that almost all points lie above 
the diagonal in the right-hand curves, we can draw the conclusion that the no-tottering extension has 
almost consistently a positive influence on the classification in both cases. It is worth noting however 
that, even though the introduction of no-tottering tree-patterns was shown to be beneficial, part of the 
overall improvement over their tottering counterparts is due to the no-tottering extension itself, since 
no-tottering walk-based kernels, observed for A = 0, already improve significantly over their tottering 
counterparts, especially for high order patterns. 

We now turn to Figure EI and the no-tottering extension © of the until -N branching-based kernel 
©. We can first notice that conclusions similar to those related to the no-tottering extension of the 
branching-based kernel can be drawn: an improvement over the corresponding walk-based kernel is 
systematically observed for tree-patterns of order greater than 2, the kernel behaves more nicely (no 
combinatorial explosion), and the no-tottering extension consistently improves over the initial until-N 
branching-based kernel (right-hand curve). Interestingly however, we note that optimal results obtained 
for 4 < h < 10 tend to converge to an optimal value around 95.5% (between 95.3 and 95.9%) for a 
A value around 0.05. While this global optimum is not as good as the overall optimal result obtained 
with the no-tottering branch-based kernel (Figure [T3l. it still remains competitive (95.5% Vs 96.5%). 
This observation contrasts with the the results obtained with the until-N extension in the tottering case, 
where patterns of a given order seemed to be lost in the amount of patterns of greater orders taken into 
account by the kernel. This is due to the fact the the no-tottering extension limits the number of patterns 
to be detected, and suggests that patterns of different orders can now be considered simultaneously in the 
kernel. This fact therefore suggests that in the no-tottering case, the until-N extension can help solving 
the problem of pattern order selection by taking a maximal pattern order large enough (here, h > 4). 

6.2 Second Dataset 

In this section, we apply the same analysis to the second dataset. 

Tree-patterns Vs walk-patterns: 
Figure [l5\ shows the results obtained with the original size-based (Q and branching -based © kernels. 
Several observations are consistent with those we drew with the fist dataset. First, the introduction 
of tree-patterns has in both cases a positive influence on the classification, and is particularly marked 
for patterns of limited order (up to a relative improvement of 12% for h = 2, and 4.5% for h = 3). 
Moreover, optimal values of the A parameter are smaller in the case of the branching-based kernel, they 
decrease for increasing h, and quickly lead to a combinatorial explosion for high-order patterns. Finally, 
we note that, in both cases, optimal AUC values are around 84%, and are obtained for patterns of order 
3 and 4, which is similar to the optimal order observed for the first dataset. However, we can note the 
interesting difference that here, tree-patterns of order 2 improve dramatically the results over their walk 
counterparts, which suggests that different molecular features are to be detected in both datasets. 
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Figure 12: First dataset. Left: evolution of the AUC with respect to A at different orders h for the no- 
tottering extension (0 of the size-based kernel 0. Right: no-tottering AUC values Vs original AUC 
values. 




Figure 13: First dataset. Left: evolution of the AUC with respect to A at different orders h for the no- 
tottering extension © of the branching-based kernel Right: no-tottering AUC values Vs original 
AUC values. 
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Figure 14: First dataset. Left: evolution of the AUC with respect to A at different orders h for the no- 
tottering extension (0 of the until-N branching-based kernel Right: no-tottering AUC values Vs 
original AUC values. 




Figure 15: Second dataset. Evolution of the AUC with respect to A at different orders h. Left: size-based 
kernel Q ; Right: branching-based kernel @. 

Until-N extension: 

Figure El shows the results obtained with the until-N extension © of the branching-based kernel $2%. 
Here again, observations are consistent with the first dataset. In particular, we can note that the results 
obtained with and without the until-N extension are very similar, and this fact is even more pronounced 
here. This second evidence confirms that the until-N extension is of little use in the original formulation 
of the kernel, most probably because patterns of a given order are drowned within the amount of patterns 
of greater orders. 

No-tottering extension: 

Figure \W\ and \W\ respectively present the results of the no-tottering extension © of the size-based 
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BRANCH-BASED UNTIL-N, TOTTERING AUC(branching-based, until-N) Vs AUC(brancriing-based) 




lambda branching-based 

Figure 16: Second dataset, until-N extension. Left: evolution of the AUC with respect to A at different 
orders h for the until-N extension © of the branching-based kernel ©. Right: AUC values Vs original 
AUC values. 



O, branching-based ©, and until-N branching-based $6^ kernels. 

Several observations are consistent with the first dataset. We can likewise note that with the no- 
tottering extension, the introduction of tree-patterns is systematically beneficial in both kernels. More- 
over, at a given order, optimal results observed with the two kernels are similar, and the corresponding 
A value is smaller with the branching-based kernel. Finally, the no-tottering extension limits the combi- 
natorial explosion of the kernels computation. 

There is however a striking difference because results are optimal here for patterns of order 3, pat- 
terns of order 2 rank second, and the results gradually decrease for orders greater than 3. This behavior 
is exactly opposite to the one we observed with the first dataset, where results gradually increased with 
the order of the patterns and were optimal for patterns of order 8. This therefore tends to confirm that 
distinct features are to be detected within the two datasets, and can be explained by the fact that the 
compounds are structurally similar in the first dataset, and different (or noncongeneric) in the second 
one. Indeed, while the kernel needs to detect subtle differences between the compounds of the first 
dataset, it must identify regular patterns within the second one, and it is not surprising that discriminat- 
ing patterns are shorter in this case. This observation supports the intuition that the choice of the order 
of the patterns should to be related to (or learned from) the dataset itself, as suggested in section 14.11 
Finally, we note that the best AUC value is around 84 % (corresponding to a relative improvement of 
7% over the corresponding walk-based kernel), and is therefore similar to that obtained with the original 
formulation of the kernel. Nevertheless, we observe from the curves on the right-hand side that contrary 
to the first dataset, the no-tottering extension has a limited overall impact. This is due to the surprising 
fact that here, the no-tottering extension does not seem to be beneficial by itself, since we can note that it 
systematically degrades the performance of the corresponding walk-based kernels, obtained for A = 0. 
As a result, even though the introduction of tree-patterns is beneficial in both cases, better performances 
can be obtained here if we consider tottering tree-patterns. Once again this behavior is opposite to that 
of the first dataset. This might be explained as well by the fact that, contrary to the first dataset, the 
molecules considered here are structurally different, and as a result, tottering can help finding common 
features between these noncongeneric compounds. 

Concerning the no-tottering extension (0 of the until-N branching-based kernel $6^, results presented 
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in Figure EH are not clear. Indeed, in that case, the introduction of the tree-patterns only improves 
the results for patterns of limited order, and for patterns of order greater than 4, results systematically 
decrease. We can however note the interesting point that optimal results obtained for patterns of order 
5 to 10 converge to a global optimal value between 85 and 86 %. This therefore tends to confirm that 
in the no-tottering case, the until-N extension can help solving the problem of pattern order selection by 
considering a maximal pattern order large enough (here, h > 4). Nevertheless, the striking difference 
with the results obtained with the first dataset is that in this case, when h > 4, the introduction of tree- 
patterns could not further improve the results obtained by the until-N walk-based kernel, that constitute 
the overall best performance we could observe for this dataset. 



SIZE-BASED, NO-TOTTERING AUC(tottering) Vs AUC(no-tottering), size-based ponderation 
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Figure 17: Second dataset. Left: evolution of the AUC with respect to A at different orders h for the 
no-tottering extension Q of the size-based kernel ©. Right: no-tottering AUC values Vs original 
values. 




Figure 18: Second dataset. Left: evolution of the AUC with respect to A at different orders h for the 
no-tottering extension Q of the branching-based kernel ©. Right: no-tottering AUC values Vs original 
values. 
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BRANCH-BASED UNTIL-N, NO-TOTTERING AUC(tottering) Vs AUC(no-tottering), branching-based ponderation, Until-N extension 
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Figure 19: Second dataset. Left: evolution of the AUC with respect to A at different orders h for the 
no-tottering extension Q of the until-N branching-based kernel Right: no-tottering AUC values Vs 
original values. 



7 Discussion 

This paper introduces a family of graph kernels based on the detection of common tree patterns in the 
graphs. In a first step, we revisited an initial formulation presented in lRamon and Gartnerl (12003). from 
which we derived two kernels with explicit feature spaces and inner products. A parameter A enters their 
definition and makes it possible to control the complexity of the features characterizing the graphs. At 
the extreme, admissible tree-patterns consist of linear chains of grap h vertices, and the k ernels resume 



to a classical graph kernel based on the detection of common walks (Gart ner et all I2003h . Walk-based 
graph kernels are therefore generalized to a wider class of kernels defined by features of increasing levels 
of complexity. In a second step we introduced two modular extensions to this initial formulation. On 
the one hand, the set of trees initially indexing the feature space is enriched by the set of their subtrees 
with an until-N extension, leading to a wider and more general feature space. On the other hand, a 
no-tottering extension prevents spurious tree-patterns to be det ected, based on th e notion of "tottering" 
initially introduced in the context of walk-based graph kernels (Ma he et all 12005). 

In the context of chemical applications, experiments on two toxicity datasets demonstrate that the 
tree-pattern graph kernels under their initial formulation improve over their walk-based counterpart. 
However, while a significant improvement could be observed for relatively small patterns, experiments 
revealed the difficulty to handle high order patterns. This is due to the fact that the number of tree- 
patterns detected in the graphs increases exponentially with their depth, which leads to a combinatorial 
explosion of the kernels computation for large patterns. For this reason, the until-N extension showed to 
be useless in this context: patterns of a given order are drowned within the flood of patterns of greater 
order, and the two kernel formulations turned out to be equivalent. With the elimination of artificial 
tree-patterns, the no-tottering extension limits this combinatorial explosion, and patterns of higher order 
can be considered in the kernel. This was in particular beneficial to the first dataset where optimal results 
were obtained with high-order no-tottering patterns. Nevertheless, we notice that this extension is not 
always beneficial, and that in some cases, artificial common patterns due to the tottering phenomenon 
can help detecting molecular similarity. This is in particular the case for the second dataset, and can be 
explained by the fact that, in opposition to the first dataset, it consists of structurally different compounds. 
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The combination of the two extensions led to mixed results. For the first dataset, we observe that the 
introduction of tree-patterns in this context could now improve over their walk-based counterparts for 
any maximum pattern order. This suggests that the limitation of the combinatorial explosion offered 
by the no-tottering extension makes it possible to combine patterns of different order in the kernel. 
However, albeit close, optimal results with the until-N extension could not come up with the optimal 
results that were obtained with no-tottering patterns of a given order. This suggests that very precise 
patterns were to be detected, and that their discriminative power is reduced by the addition of other, less 
predictive, patterns. For the second dataset, the combination of the two extensions led to optimal results. 
In that case however, the introduction of tree-patterns was not always beneficial and these optimal results 
were obtained by until-N, no-tottering walk-kernels. Finally, we can note that, when the maximum order 
of the patterns considered is large enough, results obtained with the until-N extension and no-tottering 
patterns tend to converge to a global optimum which is close, or equal to, to the overall best performance 
observed in both datasets. 

Among the possible extensions to our work, we note that it might be relevant in the context of 
chemical applications to incorporate chemical knowledge in the graph representation of the molecules. 
For instance, it is well known that physico-chemical properties of atoms are related to their position 
in the molecule, and as a first step in this direction, an enrichmen t of atom labels b y their Morgan 



indices led to promising results in the context of walk-based kernels (Mah e et all 12005ft . However, this 
particular approach is likely to have a lesser impact in this context, because the information encoded 
by the Morgan indices is at some extend already incorporated in the tree-patterns. Alternatively, we 
note that the kernel implementation could easily be extended in order to introduce a flexible matching 
between tree-patterns based on measures of similarity between pairs of vertice s and edges, fo l lowing 



T 

for instance the construction of the marginalized kernel between labeled graphs (Kashim a et all 12004). 



Such an extension would induce an increase in the cost of computing the kernel, but is likely to make 
sense for chemical applications, where atoms of different types can exhibit similar properties. 

A Proof of Propositions Q] and |2] 

In Propositions [2 and 13 we want to prove that for the graphs G\ and G*2 

E «?(t)^(Gi)^(G? 2 ) = a(h) E E^'^' ( 8 ) 

t€B h u£V Gl v&V G2 

where in Proposition^ a(h) = \~ h and w(t) = X^~ h , while in Proposition |2j a(h) = 1 and w(t) = 

^branch(t) 

From Definition |4] we have tpt(G) = ip[ u \G). As a result, 

ueV G 



E w®MGi)M<h) = E E (E^rf^)^^: 



teB h uev Gl v£Vg 2 ' teB h 

and in order to prove (|8} we just need to prove 

E w(t)4 u) (G 1 )4 v) (G 2 ) = a(h)k h (u,v). (9) 
teB h 
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A.l Proof of Proposition [T] 

In order to prove Proposition \\] it follows from Q that we just need to prove that 

±k h (u 1 v)=Y J ^ thh 4 a \G 1 )4«\G 2 ), 

t£B h 

or equivalently: 

k h (u,v) = E \\ t \4 u) (G 1 )4 v) (G 2 ), (10) 
teB h 

where ku is defined recursively by k\(u, v) = Xl(l(u) = l(v)) and for h > 1: 

k h (u,v) = Xl(l(u)=l(v)) II fcfc-i(w'y)- (11) 

ReM(u,v) (u',v')eR 

We prove (flOl by induction on h. The case h = 1 is rather trivial. Indeed, a tree of depth one is just 
a single node, and ip[ u \Gi) is therefore equal to 1 if l(u) = l(r(t)), otherwise. It follows that 

E AW^CGO^CGa) = £ Al(Z(r(t)) = I(«))l(l(r(t)) = l(v)) 
teBi teBi 

= Xl(l( u ) = l(v)), 

which corresponds to k\{u, v). 

Let us now assume that (TfOl is true at order h — 1, and let us prove that it is then also true at order 
h > 1. Combining the recursive definition of fc^ (flTt with the induction hypothesis (flOl at level /i — 1 
we first obtain: 

fc h ( Uj «) = A1(Z(«) = l(v)) E II E Al*'!^" ^!)^ ^). (12) 

ReM(u,v) (u',v')£Rt'&B h - 1 

Second, for any graph G, let us denote by (G) the set of balanced tree-patterns of order n rooted in 
u € Vg, and for any tree-pattern p £ Vn (G) let i(p) £ S n denote the corresponding tree. With these 
notations we can rewrite, for any n > 1 and (n, u) £ G*i x G2: 

E AW^CGO^CGi,) = E E Al«l(i( Pl ) = t(p 2 )). (13) 

* eB " piePi u) (Gi)p 2 e^ ) (G 2 ) 

Indeed both sides of this equation count the number of pairs of similar tree-patterns rooted in u and v. 
Plugging (Tf3l into (112b we get: 

fc h («,t;) = Al(Z(«)=i(«)) E II E E Al«l(t( Pl )=i(p 2 )). (14) 

Now we use the fact that any tree-pattern p of order h can be uniquely decomposed into a tree-pattern 
p' of order 2 and a set of tree-patterns of order h — 1 rooted at the leaves of p'. We note that matching 
two tree-patterns is equivalent to matching the tree-patterns in their decomposition, and that the sets of 
leaves of tree-patterns of order 2 rooted respectively in u and v matching each other are exactly given 
by A4(u, v). In other words, (Tl4l performs a summation over pairs of matching tree-patterns of depth 
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h, rooted respectively in u and v: the corresponding pairs of patterns of order 2 are implicitly matched 
by the summation over M(u, v) and the condition l(l(u) = l(v)), and the subsequent pairs of patterns 
(PI1P2) of order h — 1 are matched by the product of conditions l(t(p\) = t(jp2)). 

The tree-pattern p\ in G\ of such a matching pair of tree-patterns of order h rooted in (u, v) de- 
composes as a pattern of depth 2 rooted in u with leaves in some R G A4(u,v), and a set of patterns 
Pi{u') of depth h — 1 rooted in the leaves v! G R. By i\4l . to each such matching pair is associ- 
ated the weight A x Ylou' v')eR ^ ^ ' wrncri is exactly equal to A'^ p1 ^ since we obviously have 
\t(Pi)\ = 1 + 12(u' v')eR \t(Pi( u '))\- As a res ult, (Hill can be rewritten as: 

= 2 E A |t(pi)l l(t(pi) =*(?*)), 

p 1 e^" ) (Gi)»e^ ) (G 2 ) 

which combined with (fT3l proves (flOl . □ 
A.2 Proof of Proposition |2] 

The proof of Proposition [2]is a straightforward variant of the proof of Proposition [J By © we need to 
show that 

k h (u,v) = ]T X hl ^4 u \G 1 )4 v \G 2 ) , (15) 
teB h 

where kh is defined recursively by ki(u, v) = l(l(u) = l(v)) and for h > 1: 

= Z(V)) ^ tt u r 1 i\ nt.\ 
k h (u,v) = 2^ 11 xk h-i{u,v). (16) 

fleMM (u',«')eiJ 

We proceed again by induction over h to prove (Tf5l . The case fa = 1 is easily done by checking, using 
an argument similar to that of the previous proof, that < IT3T > is one if l(u) and l(v) are identical, zero 
otherwise, which corresponds to the definition of k\(u, v). If we assume that dl3t is true at the level 
h — 1, we can plug it in ( TfoT l to obtain: 

k h (u,v)= im = Kv)) E II E A 1 +^)^')(G 1 )^' ) (G 2 ) . (17) 

We can then follow exactly the same line of proof as in the previous section and obtain the following 
equations 

X bimcb ^4 u) (G 1 )4 v) (G 2 )= Yl E A branch W pi »l(t( Pl ) =t(p 2 )), (18) 

* eB " Piev { n u) (G 1 )p 2 ev { n v) (G2) 

and 

Mu,t>) = 1(i(u) ^ i(p)) E II E E x 1+M{tipi)) m P1 )=t( P2 )), 

ReM(u,v) iW,v')eR pieV W {Gl) B£P M 

(19) 

that correspond respectively to dl3t and d 1 4b . The only difference with the previous proof is in the 
exponent of A to form the weight of a matching pair of tree-patterns. By analogy with the previous proof, 
we consider the tree-pattern p\ in G\ of a pair of matching tree-patterns of depth h rooted in (u, v), that 
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decomposes as a pattern of depth 2 rooted in u with leaves in some R G M.(u, v), and a set of patterns 
Pi(u') of depth h — 1 rooted in the leaves u' G R. By dT9t . to each such matching pair is associated the 
weight i U(u',v')eR X 1+branch ^ u '^ = A- 1+E KX) e fl 1+branch (*(fi («'))). We observe that the number 
of leaves of a tree t, that we note leaves (i), is equal to 1 + branch (t). The weight associated to the above 
pair of matching tree-patterns can therefore be written as A _1+ ^< u ' u '' efl leaves ^^ Pl ^ "\ Finally, because 
the number of leaves of the tree-pattern p\ is equal to the sum of the leaves of the patterns pi(u'), it 
follows that this expression is equal to A- 1+leaves WPi)) = A branch Wfi)). As a result, we can write ST% as 

k h (u, V ) = y E \ hmncmpi)) mpi) = t( P 2)), 

which, combined with dTBt . concludes the proof. □ 

B Proof of Proposition |3] 

The proof presented in this section is very similar to the proofs of Propositions ^ and |2] Based on the 
observations made in the beginning of Appendix^ it follows from © that in order to prove Proposition 
|3] we just need to prove that 

k h (u,v) = Y A branch W# ) (Gi)^ ) (G 2 ) , (20) 

t€T h 

where is defined recursively by k\{u, v) = l(l(u) = l(v)) and for h > 1 

k h (u,v) = l(l(u) = l(v))(l + X II Mfc-i(«'V))- (21) 

R&M(u,v) (u',v')eR 

We proceed again by induction over h to prove d20l >. The case h = 1 directly follows from the proof of 
Proposition |2] If we assume that d20t is true at the level h — 1, we can plug it in dTil to obtain: 

k h (u,v) = l(l(u) = l(v))(l+ Y \ II E ^ 1+bmnch{t Hr'\G 1 )4f\G 2 )). (22) 

ReM(u,v) (u',v')ERt'eT h ^i 

By analogy with the construction of the previous proof, for any graph G, let us denote by V^f 1 (G) the 
set of tree-patterns of depth 1 to n rooted in u G Vg, and for any tree-pattern p G Vn (G) let t(p) G T n 
denote the corresponding tree. Note that Vn (G) corresponds here to general tree-patterns of depth 1 
to n, in opposition to the balanced-tree patterns of order n involved in the previous proofs. With these 
notations we obtain similarly, for any n > 1 and (u, v) G G\ x G2'. 

Y\ m ®4 u \g 1 )4 v \g 2 )= y E x bmncmpi)) mpi) = t(p*)), (23) 

and, plugging d2"3l into d2"2"l i. we get: 

k h (u,v) =1(Z(«) = /(«)) 

x ( 1+ E 1 II E E A^^^ictcpx) =*(»))), 

(24) 
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which can be further decomposed into: 

k h {u,v) =l(Z(u) = l(v)) 

+ i(i(u) = i(v)) ^ ii E A i+branch ^))i(t( P1 ) = t( P2 )). 

ReM(u,v) («'^')£fi pie p(«')( Gl ) P2e ^')(G 2 ) 

(25) 

The second part of the right member of d25t matches pairs of tree-patterns of depth 2 to n rooted in 
(u,v). It follows directly from the proof of Proposition |2] that such a pair (pi,P2) of matching tree- 
patterns is weighted by ,\ bmnch (*(pi)). The first part of the right member of d25l matches the trivial pair 
of tree-patterns of depth 1 rooted in (u, v) consisting of the single nodes (u, v). The corresponding tree 
has a zero branching cardinality, and we can therefore write 

l(l(u) = l(v)) = £ A branch W^ (n) (G 1 )^ ) (G 2 ). 
teTi 

Taken together, these two arguments show that d25l can be written as 

k h (u,v) = Y^ mncHt) 4 u) (G 1 )4 V \G 2 ), 
teT h 

which concludes the proof. □ 

C Proof of Proposition 31 



The proof is derived from results presented in Mah e et al . (2005). The sets of walks and no-tottering 



walks of the graph G = {V G , £ G ) are respectively defined by W(G) = U~ =0 W n {G) and W NT (G) = 
[Jn= ^n T (G), where 

W n (G) = {(«o, . . . , v n ) G V G +1 : (Vi,Vi +1 ) G £ G , < i < n - 1} 

is the set of walks of length n defined is Section |4~T1 and 

Wn T {G) = {(«o, ■ ■ ■ , v n ) G W n (G) '. v t + v i+2 , < i < n - 2} 

is the set of no-tottering walks of length n defined in Mahe et al. ( 20051) . We start by stating the following 
lemma. 

Lemma 1. A tree-pattern p of the graph G associated to the tree t is no tottering if, and only if any 
walk of G defined as a succession of vertices of p corresponding to nodes of t forming a path from its 
root to one of its leaves is no-tottering. 

Proof of Lemma\l\ According to Definition |9l let (vx, . . . , vu\) G Vq be a no-tottering tree pattern 
of the graph G = (V g ,£g) corresponding to the tree t = (Vt,£t), where Vt = (n\, . . . ,nui). Let 
(rii , . . . , rii k ) G Vf +1 be a path from the root of t to one of its leaves. By Definition|3j it is clear that 
(v io ,. . . ,v ik ) G W(G). Moreover, by the definition of paths we have {n im , n im+1 ) , (n im+1 , n im+2 ) G £ t 
for < m < k — 2. By Definition |9] this implies that Vi m ^ Vi m+2 for < m < k — 2, meaning that 

(fi , ... ,Vi k ) G W NT (G). Conversely, let p G be a tree-pattern of the graph G = (V G ,£ G ) 
corresponding to the tree t = (Vt,£t)- Consider the set of walks of G defined as successions of vertices 
of p associated to nodes of t forming paths from its root to its leaves. If these walks are not tottering, it 
is clear from Definition [9] that the tree-pattern itself is not tottering. □ 
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We can now state the proof of Proposition |4] 



Proof of Proposition El If, according to Dennitionl^ we let G' be the transformed graph of G, Mahe et al 
(2005) showed that there is a bijection between W NT (G) and the set of walks of G' starting in a vertex 



corresponding to a vertex of G, which can be formally defined as 

W {Vg \G') = {(yo, . . . , v n ) e W(G") : v € {V G }, n E N}, 

if we let Vg C Vc be the subset of Vg' that corresponds to Vg- It follows from Lemma[2that there is a 
bijection between the s et of no-tottering tree-patterns of G and the set of tree-patterns of G' rooted in a 
vertex of Vg- Finally, Mahi et al.1 J2005h showed that a walk in W NT (G) and its image in W {Vg} (G') 
are identically labeled, which enables to count no-tottering labeled walks in G, by counting identically 
labeled walks in G' starting in a vertex of Vg ■ It follows that counting no-tottering tree-patterns in G 
is equivalent to counting tree-patterns in G' rooted in a vertex of Vg- As a result, we have ip^ T (G) = 
(G 1 ), which concludes the proof. □ 
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